========================================================

This report explores prosper loan data collected from Prosper a peer to peer lending company. The data contains various information in 81 variables on 113,937 borrowers from November 2005 to March 2014. The company took off in 2005 and in 2009 completely changed their underwriting process, to improve their services. The data was analysed from different angles using single, bi or multi variable

Univariate Plots Section

## [1] 113937     81

There are 113937 rows and 82 columns in the dataset. Each row containing information of a unique borrower.

## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

we convert the date on which listing was created from factor to date. And find the start and end date of the dataset.

## [1] "2005-11-09"
## [1] "2014-03-10"

From the above plot we can see the distribution of the listings over the years. The number of listings increased from 2006 to 2008. But reached it’s minimum in 2009. After 2009 the number of borrowers have continued to increase every year. This is an indication that the business continued to grow. For 2014 we only have about 2 months of data.

The dip in 2009 was for the company changing it’s underwriting process not due to an economic condition.

Let’s check that if there are records on 1st August in 2009 (new system rolled after July 2009)

## [1] TRUE

Prosper started their new system after July 2009 so 1st August 2009.

## # A tibble: 9 x 3
##   CreditGrade     n precent
##   <fct>       <int>   <dbl>
## 1 ""          84984  74.6  
## 2 A            3315   2.91 
## 3 AA           3509   3.08 
## 4 B            4389   3.85 
## 5 C            5649   4.96 
## 6 D            5153   4.52 
## 7 E            3289   2.89 
## 8 HR           3508   3.08 
## 9 NC            141   0.124

Finding the percent of each credit grade, we can see that, 74.6% of the records have borrowed after the Prosper Rating was introduced after July 2009. So they don’t have a credit grade. We will mainly be working with this data set.

##   no_borrowers_in_2009
## 1                 2206

In 2009 there were 2206 borrowers

##   no_borrowers_from_August_2009  percent
## 1                          1997 90.52584

So we see that in 2009 about 90.5 % of that year’s borrowers enrolled under the new system. So this seems to be the reason why there are less count in this year. Prosper spent 6 months of the year in improving their system/services.

Summary of LoanOriginalAmount:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

From the summary of loan original amount we see that the loan range is 1000-35000 dollars. With 75% Borrowers getting loan between 1000-12000 dollars. The median and mean being 6500 and 8337 dollars respectively.

The distribution of loan original amount shows that popular amount of loans are 1000-5000, 9000-10000, 14000-15000. This is in line with my assumption that there will be more loans of lower amount than higher amounts. And also in line with the above summary.

From the percent distribution of credit grades We see that B, C, and D credit grades was popular for lending before August 2009.

We can see under the new system, majority of Prosper Borrowers have a Rating of A-D.So we see that introduction of the new system has made more groups popular than before.

It would be interesting to see later if the proportion of bad borrowers (chargedoff, defaulted) changed over time.

We want to know for what purpose Prosper borrowers take loans. From the above bar plot we see that debt consolidation is the most popular reason for taking loans from Prosper. Given that the loan range is from 1000-35000 that makes sense. It would also be interesting to see the loan amounts borrowers take for debt consolidation.

We can see the loan amounts for debt consolidation spans from 1000 all the way till 35000 dollars. Popular amounts are 3000-4000, 10000-11000, 15000-16000.

We see that 36 months is the most popular loan term.

Most borrowers are employed.

## # A tibble: 2 x 3
##   IsBorrowerHomeowner     n precent
##   <fct>               <int>   <dbl>
## 1 False               56459    49.6
## 2 True                57478    50.4

Looks like the ratio of home owners to not home owners is almost 50-50.

Next we will look at the percentage of different loan status groups:

Most people belong to “Current” loan status. Followed by “Completed” status. We will later see what listing years the “Current” group belongs. I am assuming 2013-2014 mostly.

We next check what percent of the current group are falling behind in their payment. We filter out the current loan status group where TotalProsperPaymentsBilled > OnTimeProsperPayments.

##      n  percent
## 1 1775 3.137373

3.13% of “Current” loan status borrowers might be falling short on their payments.

Summary of StatedMonthlyIncome:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750003

From the above summary results of stated monthly income we see that 75% Borrowers earn less or equal to 6825 dollars. The median value being 4667 and the mean being 5608 dollars. In the histogram plot of stated monthly income, We set limits on X-axis till 7000 dollar as otherwise we were getting just one bar at the very left end.

Summary of DebtToIncomeRatio:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

The histogram plot of debt to income ratio is right skewed, which is expected as lower debt to income ratio is a factor that would help borrowers get loan from borrowers. From the summary results we see that 75% borrowers have debt to income ratio less or equal to 0.32

Summary of EstimatedReturn:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -0.183   0.074   0.092   0.096   0.117   0.284   29084

The estimated return distribution ranges from -0.183 to 0.284. With the median being 0.092.

Tha above results make me wonder how an estimated return can be negative. So, we filter out records where EstimatedReturn<0 and see the summary as below:

##                    ListingKey  ListingNumber    ListingCreationDate 
##  009B3495741061236C0C927:  1   Min.   :416595   Min.   :2009-07-14  
##  02283482060850607BC27FE:  1   1st Qu.:433575   1st Qu.:2009-11-16  
##  02933496318097934A683F4:  1   Median :457278   Median :2010-05-07  
##  029A3499706893097DB1393:  1   Mean   :456968   Mean   :2010-05-02  
##  038535015703591930CD2E0:  1   3rd Qu.:478133   3rd Qu.:2010-10-04  
##  03AA34675425410661C8213:  1   Max.   :488374   Max.   :2010-12-18  
##  (Other)                :189                                        
##   CreditGrade       Term                     LoanStatus 
##         :195   Min.   :36   Completed             :148  
##  A      :  0   1st Qu.:36   Chargedoff            : 44  
##  AA     :  0   Median :36   Defaulted             :  3  
##  B      :  0   Mean   :36   Cancelled             :  0  
##  C      :  0   3rd Qu.:36   Current               :  0  
##  D      :  0   Max.   :36   FinalPaymentInProgress:  0  
##  (Other):  0                (Other)               :  0  
##                ClosedDate   BorrowerAPR       BorrowerRate   
##  2013-05-18 00:00:00:  3   Min.   :0.08999   Min.   :0.0693  
##  2010-02-03 00:00:00:  2   1st Qu.:0.21352   1st Qu.:0.1845  
##  2010-10-15 00:00:00:  2   Median :0.29791   Median :0.2600  
##  2011-03-01 00:00:00:  2   Mean   :0.29621   Mean   :0.2630  
##  2011-04-14 00:00:00:  2   3rd Qu.:0.37453   3rd Qu.:0.3500  
##  2011-04-28 00:00:00:  2   Max.   :0.41355   Max.   :0.3500  
##  (Other)            :182                                     
##   LenderYield     EstimatedEffectiveYield EstimatedLoss   
##  Min.   :0.0593   Min.   :-0.18270        Min.   :0.0595  
##  1st Qu.:0.1745   1st Qu.:-0.06265        1st Qu.:0.1900  
##  Median :0.2500   Median :-0.03250        Median :0.3450  
##  Mean   :0.2530   Mean   :-0.03859        Mean   :0.2865  
##  3rd Qu.:0.3400   3rd Qu.:-0.01680        3rd Qu.:0.3660  
##  Max.   :0.3400   Max.   : 0.31990        Max.   :0.3660  
##                                                           
##  EstimatedReturn    ProsperRating..numeric. ProsperRating..Alpha.
##  Min.   :-0.18270   Min.   :1.000           AA:  0               
##  1st Qu.:-0.06265   1st Qu.:1.000           A :  0               
##  Median :-0.03540   Median :1.000           B :  1               
##  Mean   :-0.04797   Mean   :1.118           C :  3               
##  3rd Qu.:-0.01865   3rd Qu.:1.000           D :  2               
##  Max.   :-0.00060   Max.   :5.000           E :  6               
##                                             HR:183               
##   ProsperScore   ListingCategory..numeric. BorrowerState
##  Min.   :1.000   Min.   :1.000             CA     : 33  
##  1st Qu.:1.000   1st Qu.:1.000             MI     : 14  
##  Median :1.000   Median :2.000             FL     : 13  
##  Mean   :2.385   Mean   :3.272             GA     : 13  
##  3rd Qu.:4.000   3rd Qu.:7.000             CO     :  9  
##  Max.   :9.000   Max.   :7.000             TX     :  9  
##                                            (Other):104  
##              Occupation      EmploymentStatus EmploymentStatusDuration
##  Other            :51   Full-time    :108     Min.   :  0.0           
##  Professional     :24   Employed     : 37     1st Qu.: 26.0           
##  Military Enlisted:15   Self-employed: 33     Median : 68.0           
##  Teacher          :10   Retired      :  8     Mean   : 84.1           
##  Sales - Retail   : 7   Not employed :  5     3rd Qu.:123.0           
##  Analyst          : 6   Part-time    :  4     Max.   :303.0           
##  (Other)          :82   (Other)      :  0                             
##  IsBorrowerHomeowner CurrentlyInGroup                    GroupKey  
##  False: 94           False:147                               :147  
##  True :101           True : 48        3D4D3366260257624AB272D:  7  
##                                       52EA3425051368132B80C96:  6  
##                                       C48E3419578591276DB6752:  4  
##                                       C95834619831959805D1458:  4  
##                                       87AF336457453170468EEA4:  3  
##                                       (Other)                : 24  
##             DateCreditPulled CreditScoreRangeLower CreditScoreRangeUpper
##  2009-07-14 07:40:03:  1     Min.   :600.0         Min.   :619.0        
##  2009-07-17 13:41:18:  1     1st Qu.:620.0         1st Qu.:639.0        
##  2009-07-19 06:49:16:  1     Median :640.0         Median :659.0        
##  2009-07-30 15:01:34:  1     Mean   :656.8         Mean   :675.8        
##  2009-07-31 15:40:14:  1     3rd Qu.:680.0         3rd Qu.:699.0        
##  2009-08-08 13:50:22:  1     Max.   :820.0         Max.   :839.0        
##  (Other)            :189                                                
##         FirstRecordedCreditLine CurrentCreditLines OpenCreditLines 
##  1988-03-01 00:00:00:  3        Min.   : 0.000     Min.   : 0.000  
##  1982-06-08 00:00:00:  2        1st Qu.: 5.000     1st Qu.: 4.000  
##  1985-04-01 00:00:00:  2        Median : 9.000     Median : 7.000  
##  1991-11-01 00:00:00:  2        Mean   : 9.287     Mean   : 7.923  
##  1992-02-01 00:00:00:  2        3rd Qu.:12.000     3rd Qu.:11.000  
##  1993-05-18 00:00:00:  2        Max.   :34.000     Max.   :30.000  
##  (Other)            :182                                           
##  TotalCreditLinespast7years OpenRevolvingAccounts
##  Min.   : 3.00              Min.   : 0.000       
##  1st Qu.:17.00              1st Qu.: 2.000       
##  Median :27.00              Median : 4.000       
##  Mean   :29.25              Mean   : 5.092       
##  3rd Qu.:40.50              3rd Qu.: 7.000       
##  Max.   :81.00              Max.   :23.000       
##                                                  
##  OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries  
##  Min.   :   0.0              Min.   : 0.000       Min.   : 0.000  
##  1st Qu.:  40.0              1st Qu.: 1.000       1st Qu.: 4.000  
##  Median : 131.0              Median : 2.000       Median : 8.000  
##  Mean   : 245.4              Mean   : 3.031       Mean   : 9.877  
##  3rd Qu.: 333.0              3rd Qu.: 4.000       3rd Qu.:14.000  
##  Max.   :2483.0              Max.   :17.000       Max.   :46.000  
##                                                                   
##  CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years
##  Min.   : 0.000       Min.   :     0   Min.   : 0.000         
##  1st Qu.: 0.000       1st Qu.:     0   1st Qu.: 0.000         
##  Median : 0.000       Median :     0   Median : 0.000         
##  Mean   : 1.308       Mean   :  1866   Mean   : 5.426         
##  3rd Qu.: 1.000       3rd Qu.:   444   3rd Qu.: 6.000         
##  Max.   :17.000       Max.   :101497   Max.   :99.000         
##                                                               
##  PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
##  Min.   :0.0000           Min.   :0.00000           Min.   :     0.0      
##  1st Qu.:0.0000           1st Qu.:0.00000           1st Qu.:   743.5      
##  Median :0.0000           Median :0.00000           Median :  4021.0      
##  Mean   :0.3538           Mean   :0.02564           Mean   :  9027.7      
##  3rd Qu.:1.0000           3rd Qu.:0.00000           3rd Qu.: 10771.5      
##  Max.   :4.0000           Max.   :1.00000           Max.   :131711.0      
##                                                                           
##  BankcardUtilization AvailableBankcardCredit  TotalTrades   
##  Min.   :0.0000      Min.   :    0.0         Min.   : 1.00  
##  1st Qu.:0.0850      1st Qu.:    7.5         1st Qu.:13.00  
##  Median :0.7800      Median :  238.0         Median :22.00  
##  Mean   :0.5925      Mean   : 2352.5         Mean   :23.91  
##  3rd Qu.:0.9550      3rd Qu.: 1535.0         3rd Qu.:33.00  
##  Max.   :1.2500      Max.   :36522.0         Max.   :73.00  
##                                                             
##  TradesNeverDelinquent..percentage. TradesOpenedLast6Months
##  Min.   :0.2000                     Min.   : 0.000         
##  1st Qu.:0.7450                     1st Qu.: 0.000         
##  Median :0.8600                     Median : 1.000         
##  Mean   :0.8345                     Mean   : 1.221         
##  3rd Qu.:1.0000                     3rd Qu.: 2.000         
##  Max.   :1.0000                     Max.   :10.000         
##                                                            
##  DebtToIncomeRatio         IncomeRange IncomeVerifiable
##  Min.   :0.0200    $25,000-49,999:70   False: 55       
##  1st Qu.:0.1900    $50,000-74,999:58   True :140       
##  Median :0.2800    $1-24,999     :21                   
##  Mean   :0.3156    $100,000+     :20                   
##  3rd Qu.:0.3625    $75,000-99,999:20                   
##  Max.   :2.2000    Not employed  : 5                   
##  NA's   :55        (Other)       : 1                   
##  StatedMonthlyIncome                    LoanKey    TotalProsperLoans
##  Min.   :    0       06FF36066150072340E9914:  1   Min.   :1.000    
##  1st Qu.: 2625       079C3578185442353B42558:  1   1st Qu.:1.000    
##  Median : 4167       096C3600123993142C20218:  1   Median :1.000    
##  Mean   : 4699       0AF1360230099663244AB4D:  1   Mean   :1.397    
##  3rd Qu.: 5767       0B8D35733283763066B9CA1:  1   3rd Qu.:2.000    
##  Max.   :33333       0C9B35732006798939C6189:  1   Max.   :4.000    
##                      (Other)                :189   NA's   :54       
##  TotalProsperPaymentsBilled OnTimeProsperPayments
##  Min.   : 1.00              Min.   : 1.00        
##  1st Qu.:16.00              1st Qu.:16.00        
##  Median :27.00              Median :26.00        
##  Mean   :29.47              Mean   :29.05        
##  3rd Qu.:36.00              3rd Qu.:36.00        
##  Max.   :76.00              Max.   :76.00        
##  NA's   :54                 NA's   :54           
##  ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
##  Min.   :0.0000                      Min.   :0.00000                
##  1st Qu.:0.0000                      1st Qu.:0.00000                
##  Median :0.0000                      Median :0.00000                
##  Mean   :0.4113                      Mean   :0.00709                
##  3rd Qu.:0.0000                      3rd Qu.:0.00000                
##  Max.   :6.0000                      Max.   :1.00000                
##  NA's   :54                          NA's   :54                     
##  ProsperPrincipalBorrowed ProsperPrincipalOutstanding
##  Min.   : 1000            Min.   :   0.00            
##  1st Qu.: 2000            1st Qu.:   0.01            
##  Median : 3500            Median : 674.12            
##  Mean   : 5918            Mean   :1094.09            
##  3rd Qu.: 7800            3rd Qu.:1556.02            
##  Max.   :35000            Max.   :8198.38            
##  NA's   :54               NA's   :54                 
##  ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
##  Min.   :-143.00             Min.   :   0.0           
##  1st Qu.:  -9.00             1st Qu.:   0.0           
##  Median :  18.00             Median :   0.0           
##  Mean   :  25.29             Mean   : 208.7           
##  3rd Qu.:  61.00             3rd Qu.:   0.0           
##  Max.   : 241.00             Max.   :1540.0           
##  NA's   :54                                           
##  LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination   LoanNumber   
##  Min.   : 1.00                 Min.   :39.00              Min.   :38061  
##  1st Qu.: 9.00                 1st Qu.:41.00              1st Qu.:39746  
##  Median :17.00                 Median :45.00              Median :42797  
##  Mean   :18.11                 Mean   :45.83              Mean   :42546  
##  3rd Qu.:24.00                 3rd Qu.:52.00              3rd Qu.:44845  
##  Max.   :40.00                 Max.   :56.00              Max.   :46381  
##  NA's   :148                                                             
##  LoanOriginalAmount          LoanOriginationDate LoanOriginationQuarter
##  Min.   : 1000      2010-10-13 00:00:00:  5      Q4 2010:67            
##  1st Qu.: 1500      2009-11-10 00:00:00:  4      Q4 2009:39            
##  Median : 2375      2009-11-18 00:00:00:  3      Q3 2010:25            
##  Mean   : 3337      2009-11-30 00:00:00:  3      Q2 2010:24            
##  3rd Qu.: 4000      2010-07-29 00:00:00:  3      Q1 2010:21            
##  Max.   :15000      2010-11-12 00:00:00:  3      Q3 2009:19            
##                     (Other)            :174      (Other): 0            
##                    MemberKey   MonthlyLoanPayment LP_CustomerPayments
##  1ACE3369930823000F335D1:  2   Min.   :  0.00     Min.   :    0      
##  233833917424416802837ED:  2   1st Qu.: 54.87     1st Qu.: 1445      
##  02193381466293394735726:  1   Median : 90.61     Median : 2525      
##  0303343173529627211CD29:  1   Mean   :134.50     Mean   : 3622      
##  0465336599513925256A8CC:  1   3rd Qu.:172.00     3rd Qu.: 4766      
##  0470339919173724675477C:  1   Max.   :678.37     Max.   :21302      
##  (Other)                :187                                         
##  LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees   
##  Min.   :    0                Min.   :   0.0     Min.   :-235.52  
##  1st Qu.: 1000                1st Qu.: 299.1     1st Qu.: -44.66  
##  Median : 1600                Median : 564.0     Median : -24.53  
##  Mean   : 2657                Mean   : 965.0     Mean   : -34.63  
##  3rd Qu.: 3500                3rd Qu.:1275.6     3rd Qu.: -13.01  
##  Max.   :13000                Max.   :8302.4     Max.   :   0.00  
##                                                                   
##  LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
##  Min.   :-793.04   Min.   :    0.0       Min.   :    0.0    
##  1st Qu.:   0.00   1st Qu.:    0.0       1st Qu.:    0.0    
##  Median :   0.00   Median :    0.0       Median :    0.0    
##  Mean   : -14.38   Mean   :  678.1       Mean   :  669.5    
##  3rd Qu.:   0.00   3rd Qu.:    0.0       3rd Qu.:    0.0    
##  Max.   :   0.00   Max.   :14579.6       Max.   :14579.6    
##                                                             
##  LP_NonPrincipalRecoverypayments PercentFunded Recommendations
##  Min.   :   0.00                 Min.   :1     Min.   :0.000  
##  1st Qu.:   0.00                 1st Qu.:1     1st Qu.:0.000  
##  Median :   0.00                 Median :1     Median :0.000  
##  Mean   :  22.26                 Mean   :1     Mean   :0.241  
##  3rd Qu.:   0.00                 3rd Qu.:1     3rd Qu.:0.000  
##  Max.   :2443.00                 Max.   :1     Max.   :4.000  
##                                                               
##  InvestmentFromFriendsCount InvestmentFromFriendsAmount   Investors     
##  Min.   :0.0                Min.   :   0.0              Min.   :  1.00  
##  1st Qu.:0.0                1st Qu.:   0.0              1st Qu.: 12.00  
##  Median :0.0                Median :   0.0              Median : 32.00  
##  Mean   :0.2                Mean   : 217.7              Mean   : 40.39  
##  3rd Qu.:0.0                3rd Qu.:   0.0              3rd Qu.: 60.50  
##  Max.   :3.0                Max.   :7000.0              Max.   :169.00  
##                                                                         
##  ListingCreationYear
##  Min.   :2009       
##  1st Qu.:2009       
##  Median :2010       
##  Mean   :2010       
##  3rd Qu.:2010       
##  Max.   :2010       
## 

From the summary we can see that majority of borrowers(183/195) of this group belong to the high risk group(HR) of prosper rating system. We can see the that the total inquiries is quiet lot from the whole data set with median value of 8 compared to a 4 for the whole data set. In terms of Bank card utilization we can see a median value of 0.78 compared to the 0.6 value for the whole data set. Other reasons including these might have contributed for a negative estimated return for this group.

Summary of BorrowerAPR:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229      25

Borrowers APR distribution ranged from 0.007 to 0.5 with median at 0.21. It would be interesting to look at the distribution of Borrower APR by ProsperRating. We expect the highest rating AA to have low APR and the lowest rating HR to have the highest APR.

Summary of TotalInquiries:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   2.000   4.000   5.584   7.000 379.000    1159

From the bar plot we can see that the plot is right skewed. With the count decreasing as total number of inquiries increased. The median value is 4 for total inquiries.

Summary of Delinquencies_Last7Years:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   4.155   3.000  99.000     990

Distribution of delinquencies in the last 7 years was plotted above. This plot also is right skewed. The X-axis was limited till 10 as 75% data was less or equal to 3. The median value is 0. So it’s an extremely rare event for prosper borrowers.

Summary of Bankcard Utilization:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.310   0.600   0.561   0.840   5.950    7604

Bank Card utilization is the percentage of available revolving credit that is utilized at the time the credit profile was pulled.The median value for this variable is 0.6. So we see a plot that has more than 75% within 1.0. We see the count increasing with increase in Bank card utilization till 0.9. Most of the Borrowers are borrowing money for debt consolidation, which might be the reason why we see count increasing with higher card utilization.

Summary of Trades opened in last 6 months:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   0.802   1.000  20.000    7544

The variable TradesOpenedLast6Months has a median value of 0 and a 3rd quartile value of 1. So 75% Borrowers will have no or just one trades opened in last 6 months.

Univariate Analysis

What is the structure of your dataset?

This dataset contains records from 113937 borrowers with 81 variables. These variables contain all information regarding the borrower and the loan. Such as, original amount of loan, the APR and interest rate, for how many months it has been sanctioned, the credit score range, the prosper score and rating, payments by borrowers etc.

A noteworty variable is ProsperRating..Alpha. that shows the prosper rating of the borrowers based on their proprietary rating system.Each letter grade corresponds to an estimated average annualized loss rate range.So it is an indicator of expected loss. The variable is an ordered factor variable with the following levels:

(low)———–>(High) in terms of risk.

AA, A, B, C, D, E, HR

Loan Status is another important variable, showing the current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.

Other important observations:

  • The loan range is from 1000-35000 dollars.
  • The prosper rating system was introduced after July, 2009
  • 77% borrowers borrow money for 36 months.
  • The APR has a range of 0.07-0.5.
  • Most borrowers use prosper for debt consolidation.
  • The estimated return can range from -0.183-0.284.
  • Most of the borrowers are employed.
  • 75% of the borrowers have a debt to income ratio of 0.32 or less.

What is/are the main feature(s) of interest in your dataset?

According to prosper.com prosper rating represents an estimated average annualized loss rate range to the investor. The Estimated Loss mainly depends on Prosper Score and Credit Score. Prosper Score is the custom risk score built by prosper. So I am interested in identifying the characteristics of the borrowers that determines the prosper score.

I am also interested in finding out if lending money on this platform is beneficial for lenders.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Features like number of inquiries, bank card utilization, number of deliquent accounts, number of recently opened trade, debt to income ratio, stated monthly income seems to be variables that could determine the prosper score.

I also want to determine the difference between the estimated return and the annualized return. This will show how much in line the predictions of prosper is for the estimated return. According to Investopedia “An annualized total return is the geometric average amount of money earned by an investment each year over a given time period. It is calculated as a geometric average to show what an investor would earn over a period of time if the annual return was compounded.”

Did you create any new variables from existing variables in the dataset?

Yes. In the univariate section I added the year, the listing originated. I used lubridate package to get the year from the date.

I also have calculated the following variables in the multivariate section to see the returns of the records in the dataset:

  • invested_amount=LoanOriginalAmount-LP_ServiceFees-LP_CollectionFees
  • return_obtained=(LP_CustomerPayments-invested_amount)/invested_amount
  • annualized_return=(1 + return_obtained)^(12/Term) - 1

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

When I plotted estimated return. I found the range to be -0.183 to 0.284. It was odd to see a negative return. So then I filtered out data based on estimated return less or equal to zero. The data showed that majority of borrowers(183/195) of this group belong to the high risk group(HR) of prosper rating system. We can see that the total inquiries is quiet lot with median value of 8 compared to a 4 for the whole data set. In terms of Bank card utilization we can see a median value of 0.78 compared to the 0.6 value for the whole data set. These might be the reasons of a negative return.

For the dataset I had to change the ListingCreationDate variable from factor to date. I also separated out the year in a variable called ListingCreationYear for further analysis.

Later I have also created Invested_amount which is the amount that the investers spent on a loan. And returned_amount, which is the amount borrowers have returned. We used this to calculate the annualized_return which is the amount of money earned by an investment each year over a given time period. We compare this with the estimated return to see how much is the difference between estimated and actual return.

Bivariate Plots Section

Let’s see what the estimated loss rate is for each prosper rating group.

We can see the range of the estimated loss rate for each rating group in the plot above. AA has the lowest estimated loss rate and with each rating the loss rate increased.

Let’s see the effect of the two variables, prosper score and credit score on the Estimated Loss.

We see as the prosper score increased the estimated loss decreased.

As before we can see that as the credit scores increased the Estimated Loss decreased or the prosperratings improved. Here we used credit score range lower to keep things simple.

Linear Model of Estimated Loss with Prosper Score and CreditScore

## 
## Calls:
## m1: lm(formula = EstimatedLoss ~ CreditScoreRangeLower, data = prosper_loan_data)
## m2: lm(formula = EstimatedLoss ~ CreditScoreRangeLower + ProsperScore, 
##     data = prosper_loan_data)
## 
## =========================================================
##                                m1              m2        
## ---------------------------------------------------------
##   (Intercept)                  0.435***        0.357***  
##                               (0.002)         (0.002)    
##   CreditScoreRangeLower       -0.001***       -0.000***  
##                               (0.000)         (0.000)    
##   ProsperScore                                -0.011***  
##                                               (0.000)    
## ---------------------------------------------------------
##   R-squared                    0.261           0.534     
##   adj. R-squared               0.261           0.534     
##   sigma                        0.040           0.032     
##   F                        30028.249       48530.965     
##   p                            0.000           0.000     
##   Log-likelihood          152329.093      171830.556     
##   Deviance                   137.054          86.550     
##   AIC                    -304652.185     -343653.112     
##   BIC                    -304624.139     -343615.717     
##   N                        84853           84853         
## =========================================================

We create this linear model as prosper calculates the estimated loss using prosper score and credit score. As can be seen from the R^2 value, the Credit score can explain 26.1% of the variance in estimated Loss, while both Credit score and prosper score combined can explain 53.4% of the variance. It is evident that other factors contribute to rest of the variance observed.

We would now like to know which variables contribute to the prosper score.

Analysis of variables that could contribute to prosper score:

Variables used for the ggmatrix-ProsperRating..Alpha., ProsperScore, CreditScoreRangeLower, InquiriesLast6Months, TotalInquiries, Delinquencies Last7Years, BankcardUtilization, TradesOpenedLast6Months, DebtToIncomeRatio

I choose a few variables that could have an impact on the ProsperScore. We can see that as prosper score increases number of inquires in last 6 months, total inquiries, the number of trades opened last 6 months as well as Bankcard utilization decreases.

We see a negative relationship between Delinquencies last 7 years with inquiries in last 6 months. More inquiries would mean they need a loan. So these borrowers would try their best to avoid delinquency as it would make it difficult to get a loan.

We see a negative relationship between delinquencies of 7 years and trades opened last 6 months.So it is important to have good standing in delinquencies for opening multiple trades.

We see that as the DebtToIncome Ratio decreases, trades opened in last 6 months increases. It could be that the trades have produced income to decrease the ratio. We will check this later.

We see that as the DebtToIncome Ratio decreases the no of inquiries increases. So borrowers wanting more loan tend to be more vigilant about their spending in terms of their Income.

Prosper Score vs Inquiries:

Here we plot both total inquiries and inquiries last 6 months, to see the effect of total and recent inquiries on prosper score. We see that as the number of inquiries increase the Prosper Score falls. So number of inquiries could be an important variable for determining prosperscore.

Here I have plotted the prosper score as factor on the x-axis. I felt plotting the prosper score on the X-axis shows the relationship much more clearer.

Now to look at the range of the inquiries for each rating group we need a box plot. Using the categorical variable Prosper Ratings lets see the ranges of the inquiry variable for each rating group.

For the boxplot we limit our y axis till 20. For better visualizations. As the number of inquiries increase the prosper score decreases. We see most inquiries are 8 or less. Better score borrowers are much more financially sufficient and thus have less number of inquiries.

Prosper Score vs Delinquencies:

For delinquencies we see some decrease in number with increase of prosper score especially for the higher scores 9-11. For better view we have limited our y-axis to 10. We will check out the box plot for a better grasp.

Although the median is the same for all prosper score, we can see as the score go from worse to good we see the number of delinquencies decreasing. It’s more visible in case of the last 7 years.Most of the borrowers have 5 or less Delinquencies in the last 7 years. Better prosper score borrowers tend to pay their bills on time.For the box plots I have limited the y-axis to 20.

Prosper Score vs AvailableBankcardCredit:

More bank card credit is available for better prosper score borrowers. As they can pay off their credit card debt on time, and thus have more available bank card credit.We see most borrowers having 50000 orless availanle bank card credit.

Prosper Score vs BankcardUtilization:

From the median values of the box plot we can see that as Bank card utilization decreases the prosper score gets better.Other than the outliers, Bank card utilization is less than 1. 1 means utilization of all credit card amount.Borrwers of better prosper score tend to pay off card balance better than the lower score borrowers.

Prosper Score vs Trades opened:

As number of trades decrease, the prosper score gets better. We have jittered the plot for better visualization in overplotting.

From the box plots also we see the same trend. Most total trades are below 38 total trades. For the more recent trades we see numbers below 2.5.

Prosper Score vs debttoincomeratio:

We see that as debt to income ratio decrease, the prosper score gets better. So borrowers with better prosper score tend to have less debt and more income and thus likely to pay off their debt. Here for bott the scatter plot and box plot we have zoomed in on 0-2.5 on the y-axis to better visualize the trend.

Prosper Score vs statedmonthlyincome:

We see that prosper score is related to stated monthly income. We had to use jitter and alpha to see the trend amongst overplotting in case of overplotting. Borrowers having higher monthly income have a better prosper score.The box plot show that most of the stated monthly income is 12000 or less.

Now we will look at other bivariate relationships.

The frequency of each prosper rating group per loan status group:

We have plotted the fraction of each rating group foreach of the loan status group. We can see from the above frequency plot that D, E and HR groups make the most of chargedoff and defaulted borrowers. In the past due groups C, D and E dominates.

Next using the following variables we observe their effect mainly on the APR: BorrowerAPR, EstimatedLoss, EstimatedReturn, ProsperRating..numeric., EmploymentStatusDuration, CreditScoreRangeLower, DebtToIncomeRatio, StatedMonthlyIncome, TotalProsperLoans, TotalProsperPaymentsBilled, OnTimeProsperPayments, ProsperPrincipalOutstanding, LoanOriginalAmount.

From the above ggmatrix, we can see when the BorrowerAPR increases the estimated loss and estimated return increases. The APR also increases as ProsperRatings move from AA to HR.The Borrowers that have high estimated loss also has high estimated return. So they have a high APR which will lead to high returns if they pay off and high loss if don’t pay.

The proportion of each rating group per year from August 2009:

We wanted to see if over the years any prosper rating group is favored over others. From the plot we see that “A” is the most popular group in 2009. D" seems to be the most popular group of borrowers for 2010-2011. HR in “2012”, and “C” in 2013-2014. Although we just have 2 months of data for 2014. So that could change. So we can see that the higher risk grouos are actually favored for investing.

If the proportion of bad(chargedoff/ defaulted) borrowers changed with time:

It will also tell us if prospers new system is working or not in reducing chargeoffs and defaults. We plot the percent of bad(chargeoff, defaulted), good(completed) and current borrowers for each year to see the change.

Percent of the different group of borrowers per year:

## # A tibble: 10 x 4
##    ListingCreationYear total_bad total_good total_current
##                  <dbl>     <dbl>      <dbl>         <dbl>
##  1                2005   0          1             0      
##  2                2006   0.396      0.604         0      
##  3                2007   0.391      0.609         0      
##  4                2008   0.328      0.672         0      
##  5                2009   0.152      0.848         0      
##  6                2010   0.167      0.826         0.00542
##  7                2011   0.191      0.482         0.297  
##  8                2012   0.132      0.279         0.541  
##  9                2013   0.00895    0.0629        0.901  
## 10                2014   0          0.00578       0.992

We can see before the new system was in place the percentage of loss was pretty high (chargedoff, defaulted), Only from 2009 it became low. We have current borrowers still paying from 2010 and onwards. Their number increases as the year increases. From the data it can be seen that 2010 was close to 2009 in terms of bad and good borrowers. 2011 has about 30% current borrowers but their loss is already 19.1%. 2011 and onwards there is a lot of current borrowers. So we cannot really conclude how much the loss would be. May be when more data is available we can make a model of previous years to predict the loss.

Loan amount vs term:

Lets now see if the loan amount depends on the number of months for which it is sanctioned. From the box plot, We can see as the loan amount increased, the borrowers were given more time to pay it back.

Is there a relationship between loan amount and prosper rating:

From the boxplot We see as the prosper rating goes from better to worse the loan amounts decrease significantly. Especially E-HR group mostly can secure loan amounts of 5000 or less. D can mostly secure loan of 10000 or less. Whereas for rest of the groups 15000-16000 is a common loan amount.

BorrowerAPR vs Prosper Rating:

From the boxplot we found the BorrowerAPR to be strongly related to the prosper ratings. The APR value is lower for AA and kept increasing as we moved through the consecutive risky ratings.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

From the linear model of Estimated loss we found both credit score and prosper score explaining about 53.4 % of variance.

We found that variables StatedMonthlyIncome, DebttoIcome ratio, TradesOpenedLast6Monsths, AvailableBankcardcredit, BankCardutilization DelinquenciesLast7years, TotalInquiries could influence the prosper score. For the ease of use of box plots (to see the range, IQR and medians), I transformed the prosper score to factor to see the relationships with other variables.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

It looks like A-D are the most popular group for lending.

To determine loan performance I found out the frequency of each prosper rating group in each of the loan Status groups. Specifically I was interested in seeing the distribution of the rating in chargedoff and defaulted group. We found D, E and HR borrowers more than any other group in the charged off and defaulted section.

As expected we found the BorrowerAPR to be strongly related to the prosper ratings. The APR value is lower for AA and kept increasing as we moved through the consecutive risky ratings.

I found that the loan amount sanctioned depends on the prosper rating as well. With 75% or more of the borrowers belonging to D, E and HR group receiving loans less than 10000.

What was the strongest relationship you found?

From the figures, we can say that the available Bank credit and Debt to income ratio have a strong relationship to prosper score than the other variables tested.

Multivariate Plots Section

Debt to income ratio vs trades opened in last 6 months:

As observed in the previous section we saw that as the DebtToIncome Ratio decreases, trades opened in last 6 months increases. Our assumption was that more trades produce more income and hence the ratio decreases.

We plotted the debt to income ratio against trades opened in last 6 months faceted by income range. We see higher income range group have more trades opened in last 6 months. At first glance it feels that the debt to income ratio decreases as number of trades opened increases. But when we add a smooth curve to the plots we see that that’s not the case. The ratio remains almost constant with the change of the number of trades opened in the last 6 month, irrespective of the income range. We have limited the x-axis and y-axis to get better visualization.

How does ProsperScore and Credit Score dictate the ProsperRatings:

This plot show how the prosper rating is influenced by prosper score and credit score. High values of both score means AA rating and low values means HR. The rest falls in between with some overlapping. Thus there is other factors that contribute to the final Prosper rating.

How does the prosperrating align with estimated loss vs estimated return:

We want to see how estimated loss and estimated return together influence the prosper rating. We can see for the most part as the loss increases return increases and vice versa. From the smooth curve we can see that the estimated return peaks at ProsperRating E, so the Estimated return is higher at E than any other group. D and HR seems to have similar returns.

Linear Model of ProsperScore:

Based on the relationships we observed in the previous section we decided to construct a linear model of prosper score with TotalInquiries, DelinquenciesLast7Years, AvailableBankcardcredit, TradesOpenedLast6Months, DebtToIncomeRatio and StatedMonthlyIncome.

## 
## Calls:
## m1: lm(formula = ProsperScore ~ TotalInquiries, data = prosper_loan_data)
## m2: lm(formula = ProsperScore ~ TotalInquiries + DelinquenciesLast7Years, 
##     data = prosper_loan_data)
## m3: lm(formula = ProsperScore ~ TotalInquiries + DelinquenciesLast7Years + 
##     AvailableBankcardCredit, data = prosper_loan_data)
## m4: lm(formula = ProsperScore ~ TotalInquiries + DelinquenciesLast7Years + 
##     AvailableBankcardCredit + TradesOpenedLast6Months, data = prosper_loan_data)
## m5: lm(formula = ProsperScore ~ TotalInquiries + DelinquenciesLast7Years + 
##     AvailableBankcardCredit + TradesOpenedLast6Months + DebtToIncomeRatio, 
##     data = prosper_loan_data)
## m6: lm(formula = ProsperScore ~ TotalInquiries + DelinquenciesLast7Years + 
##     AvailableBankcardCredit + TradesOpenedLast6Months + DebtToIncomeRatio + 
##     StatedMonthlyIncome, data = prosper_loan_data)
## 
## ===========================================================================================================================
##                                  m1              m2              m3              m4              m5              m6        
## ---------------------------------------------------------------------------------------------------------------------------
##   (Intercept)                    6.525***        6.581***        6.105***        6.321***        6.772***        6.474***  
##                                 (0.012)         (0.012)         (0.012)         (0.013)         (0.014)         (0.016)    
##   TotalInquiries                -0.134***       -0.130***       -0.136***       -0.111***       -0.115***       -0.123***  
##                                 (0.002)         (0.002)         (0.002)         (0.002)         (0.002)         (0.002)    
##   DelinquenciesLast7Years                       -0.020***       -0.010***       -0.008***       -0.011***       -0.011***  
##                                                 (0.001)         (0.001)         (0.001)         (0.001)         (0.001)    
##   AvailableBankcardCredit                                        0.000***        0.000***        0.000***        0.000***  
##                                                                 (0.000)         (0.000)         (0.000)         (0.000)    
##   TradesOpenedLast6Months                                                       -0.470***       -0.503***       -0.504***  
##                                                                                 (0.008)         (0.008)         (0.008)    
##   DebtToIncomeRatio                                                                             -1.145***       -1.023***  
##                                                                                                 (0.023)         (0.023)    
##   StatedMonthlyIncome                                                                                            0.000***  
##                                                                                                                 (0.000)    
## ---------------------------------------------------------------------------------------------------------------------------
##   R-squared                      0.047           0.053           0.152           0.189           0.238           0.251     
##   adj. R-squared                 0.047           0.053           0.152           0.189           0.238           0.251     
##   sigma                          2.321           2.313           2.188           2.140           2.062           2.045     
##   F                           4143.125        2369.708        5072.617        4937.739        4840.937        4323.506     
##   p                              0.000           0.000           0.000           0.000           0.000           0.000     
##   Log-likelihood           -191829.302     -191545.952     -186853.027     -184973.240     -166186.919     -165530.663     
##   Deviance                  456913.164      453871.785      406344.650      388733.796      329863.463      324328.093     
##   AIC                       383664.603      383099.903      373716.053      369958.479      332387.838      331077.327     
##   BIC                       383692.650      383137.298      373762.797      370014.571      332452.649      331151.397     
##   N                          84853           84853           84853           84853           77557           77557         
## ===========================================================================================================================

From our linear model we can see that 25.1% of the variance in the prosper score can be explained by the tested variables. The p value is <0.05. So the result is significant. The added variables have made the model fit better. From the amount of variance explained by this model, we can say that there are definitely other factors that contribute to the prosper score

The difference between estimated and annualized return:

We need to calculate the annualized return to estimate how much annual return the investor is getting. We first calculate the invested amount by adding the loan original amount with service fees and collection fees. “-” sign is used to add service and collection fees as they are listed as “-”" value.Then the cummulative return (return obtained) and the annualized return are calculated. ref

We plotted the mean of the estimated and anualized return per day against each listing creation day of the year. We see that the mean annualized daily return is less than the estimated return. What is interesting is that the annualized return was close to the estimated return in earlier years and the difference seems to have increased over time. This is opposite of what I had expected.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

We saw here how the prosper score and credit score determines the prosper ratings. When both are good we get the best rating AA and we get HR when both are bad. The rest falls in between.

we created a linear model of variables that shape the prosper score.

we saw the difference between the estimated return and the actual return. Actual return was close to estimated in earlier years and the difference seems to have increased over time. From the plot we can also say that prosper estimates higher return compared to the actual return that investors get.

Were there any interesting or surprising interactions between features?

Yes. I found it really interesting that the difference between the estimated and annualized returns seem to have increased over time. This is opposite of what I had expected. As companies are looking for ways to improve so they polish their methods to better estimate the actual values.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

In the last part we were trying to identify some of the key variables that impact the prosper score. We found that variables StatedMonthlyIncome, DebttoIcome ratio, TradesOpenedLast6Monsths, AvailableBankcardcredit, DelinquenciesLast7years, TotalInquiries could answer 25% variance observed in prosper score.

Final Plots and Summary

Plot One

Description One

We did this plot to determine how the Prosper rating groups are doing when paying off their loan. We see that D, E and HR groups make the most of chargedoff and defaulted borrowers. In the past due groups C, D and E dominates.

Plot Two

Description Two

We can see here that the prosper rating of the borrower influences the amount of loan they get. AA has a higher median loan amount than the rest. D, E and HR have the lowest median amounts. Interestingly the variation in the loan amount decreased as the ratings moved from AA to HR. AA to C has somewhat similar variance with each other. But E and HR variance is really low. So although high risk group are estimated to have higher return, prosper have checks in place to reduce the effect of loss of the investors.

Plot Three

Description Three

In the above plot, We have plotted the mean estimated return per day and the mean annualized return per day against the ListingCreationDate. We see that the mean annualized daily returns is less than the estimated return. What is interesting is that the annualized return was close to the estimated return in earlier years and the difference seems to have increased over time. Also we see the return being estimated higher than it actually is.

Reflection

This dataset had 113937 records and 81 variables. So I found it overwhelming at first. Also I had difficulty understanding the variables. Although there definition were provided I still had to research to understand their meanings. I found this a good opportunity to work on a peer to peer lending company. I had to reasearch on how prosper works, to decide on the variables that I wanted to finally work with.

The data set had records from 2005 to 2014. I focused on records after July 2009 as that was when prosper came up with significant changes to their processes. From my research I found that prosper ratings allow investors to consider a loan’s level of risk because the rating represents an estimated average annualized loss rate range. I found that it is calculated mainly using credit score and prosper score. The prosper score is also calculated by prosper based on the historical prosper data. So I wanted to know which are the key variables that influence prosper score. I made a linear model with the prosper score as the response variable and a number of predictor variables. The predictor variables selected, were based on my research on how prosper works. I got a weak model that could explain 25% of the variation observed in prosper score. So further work needs to be done to improve the model with other variables or transformations of the existing variables. So that we can better predict the prosper score given the charateristics of the borrower.

I was also interested in determining how much the investors are actually getting back from their investment in prosper compared to their estimated returns. I determined the annualized return from the invested amounts and the paid back amounts of the borrowers. Once plotted I saw that prosper overestimates it returns. Also in earlier years the estimated return was actually closer to the annualized return and later these diverged more. So this needs to be investigated further to determine if there was a flaw in my calculations. As one would expect the platform to be better over time.